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Abstract 

For the challenging task of modeling multivariate 
time series, we propose a new class of models 
that use dependent Matern processes to capture 
the underlying structure of data, explain their in¬ 
terdependencies, and predict their unknown val¬ 
ues. Although similar models have been pro¬ 
posed in the econometric, statistics, and machine 
learning literature, our approach has several ad¬ 
vantages that distinguish it from existing meth¬ 
ods: 1) it is flexible to provide high prediction 
accuracy, yet its complexity is controlled to avoid 
overhtting; 2) its interpretability separates it from 
black-box methods; 3) finally, its computational 
efficiency makes it scalable for high-dimensional 
time series. In this paper, we use several simu¬ 
lated and real data sets to illustrate these advan¬ 
tages. We will also briefly discuss some exten¬ 
sions of our model. 

1. Introduction 

Developing powerful models that can capture the dynamics 
of multivariate time series data, in order to explain their 
dependencies and predict their unknown values, remains 
a difficult task in statistics and machine learning. A key 
challenge is to answer: 

Question. How can we describe correlations among mul¬ 
tiple time series 

Xi{t),X2{t),...,Xp{t), (1) 

in a way that is also useful for prediction? 

In this paper, we tackle this issue by proposing a special 
case of multivariate Gaussian processes that we call De¬ 
pendent Matern Processes (DMP). Similar models have 
been previously proposed in the econometrics, statistics, 
and machine learning literature. Here, we follow the re¬ 
cent work of (Sarkka et al., 2013) in considering Gaus¬ 
sian processes from the viewpoint of stochastic differential 
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equations, and attempt to elucidate the mathematical un¬ 
derpinnings of this approach. Despite the similarity to sev¬ 
eral existing methods, our focus is on constructing a more 
interpretable model that can explain dependencies among 
multiple time series, but without sacrificing flexibility or 
scalability. 

This paper is organized as follows. We discuss univariate 
Gaussian process (GP) models in Section 2, and briefly re¬ 
view several methods to generate multivariate GP. In Sec¬ 
tion 3, we present our proposed method. Dependent Matern 
Processes. This is a special case of multivariate GP with 
properties that make it a powerful alternative to existing 
methods. Section 4 shows how univariate GPs can be in 
fact presented either as infinite dimensional functions or as 
solutions to a specific class of stochastic differential equa¬ 
tions. Following (Sarkka et al., 2013), we show how these 
two alternative representations are connected, and use this 
insight to develop the inferential framework of our DMP 
model in Section 5. In Section 6, we present several experi¬ 
ments to illustrate the advantages of our method. Finally, in 
Section 7, we discuss possible extensions of this approach. 

2. Preliminaries 

Throughout this paper we will stay within the framework 
of Gaussian processes (GP). In this section, we discuss uni¬ 
variate and multivariate GP. We represent scalar quantities 
with lower-case and use capital letters to represent vectors. 
Boldface capital letters represent matrices. 

2.1. Univariate Gaussian Processes 

A Gaussian process (GP) on the real line is a random 
real-valued function x{f), with statistics completely deter¬ 
mined by its mean function Ea;(s) and kernel K{s,t) = 
Cov{x{s),x{t)). More precisely, all finite-dimensional 
distributions (a;(fi),... ,x(tn)) are multivariate Gaussian 
with mean (Ea:(fi),..., Ea;(fn)), and with covariance ma¬ 
trix (K(tk,te))'^ Since the latter must be positive semi- 
definite for every finite collection of inputs U,..., only 
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certain kernels k are valid - that is, define Gaussian pro¬ 
cesses. Thus when using Gaussian processes, a practi¬ 
tioner often chooses from among the few popular classes 
of kernels, such as the Squared Exponential (SE), Ornstein- 
Uhlenbeck (OU), Matern, Polynomial, and linear combina¬ 
tions of these. 

In general, the choice of kernel encodes our qualitative be¬ 
liefs about the underlying signal. Eor instance, the OU ker¬ 
nel produces non-differentiable functions x(f), while the 
SE kernel is infinitely differentiable. In this paper we will 
concentrate on the Matern class of kernels, which have 
as hyper-parameters the smoothness v, length-scale I, and 
variance In particular, for n = 0,1,... and v = ^ +n it 
is known that realizations x{t) of a GP with Matern kernel 
are n times continously differentiable, while k(s, t) decays 
at rate as t — s becomes large (Stein, 1999; 

Rasmussen & Williams, 2006). It is then straightforward 
to add extra observation noise reflecting our uncertainty in 
the accuracy of our measurements. 

2.2. Multivariate GPs 

There often arise situations where we would like to jointly 
model several time series xi{t),X 2 {t ),... ,Xp{t), for the 
purpose of inference, in particular attempting to quantify 
the dependencies between the observed series, and/or to 
improve our predictions of one series using data from the 
others. In the context of Gaussian processes, we intend 
that for different processes i and j we have a non-zero co- 
variance. In fact, a multi-output or multivariate Gaussian 
process can be defined just as in Section 2.1, but where the 
kernel function now depends on two pairs of inputs. Eor 
simplicity we will assume in what follows that the mean of 
each time series is the zero function. The kernel k is now 
defined for = and s,t gM. as 

K{[i,s],[j,t]) =Exi{s)xj{t). (2) 

The initial challenge within the Gaussian process context is 
to produce a valid and interpretable kernel. 

2.2.1. Linear models 

The usual technique for generating multivariate GP ker¬ 
nels is known as co-kriging from the geostatistical litera¬ 
ture (Cressie, 1993). The simplest case is known as the 
intrinsic co-regionalization model (ICM), where one takes 
C = {cij) to be a positive definite p x p matrix, (s, t) 
to be a valid univariate kernel, and defines the multi-output 
kernel k to be their product 

Exi{s)xj{t) = CijK‘'^\s,t). (3) 

Notice that while the intrinsic co-regionalization model is 
easily interpretable - the single matrix C provides the co- 
variances between the time series - all outputs share the 


same univariate kernel, which makes for a rather inflexible 
model. 

The linear model of coregionalization (LCM) adds more 
flexibility by allowing linear combinations of ICM’s, re¬ 
sulting in a kernel of the form 

Ex*(s)Xj(t) (4) 

k^l 

Eor each k = 1,... ,q, t) is assumed to be a valid 

kernel for a univariate GP, and is assumed 

to be a positive definite matrix. It is not hard to see that (4) 
results in a valid kernel. However, we have now lost some 
the interpretability of the ICM. More problematically, the 
LCM still does not provide a notion of correlation between 
processes with differing length-scales. One proposed solu¬ 
tion is the process convolution approach (Boyle & Erean, 
2005; Alvarez & Lawrence, 2011), which allows for qual¬ 
itatively very different processes to be correlated, though 
with some loss of interpretability. 

2.2.2. Latent models 

Another approach is to describe {xi{t),... ,Xp{t)) as 
linear combinations of latent factors. We suppose 
ui{t),..., Uqif) are independent mean zero Gaussian pro¬ 
cesses, and let 

Ti (f) — ^ ^ (/), foi % — 1, 2,... p. (5) 

fc=i 

Let Ki{s,t) = Eui{s)ui{t) be the kernel for the i’th 
latent process. Then the observed processes x(f) = 
{xi{t),... ,Xp{f)) are jointly mean-zero Gaussian with co- 
variances 

<? 

E.Xi{s)xj{t) = ( 6 ) 

k^l 

This is the semi-parametric latent factor model of (Teh 
et al., 2005), so-called because the linear combination of 
latent GP’s is parameterized by the matrix of coefficients 
A = {ai^k), while each Gaussian process is of course a 
non-parametric model. However, this latent model (5) is 
actually an example of the above linear model of coregion¬ 
alization, where is just the outer product of the vector 
a. fc with itself. 

See (Alvarez et al., 2011) for a nice survey of these and 
other variants of co-kriging used in the machine learning 
literature. 

2.2.3. Other approaches 

Instead of trying to create multivariate kernels in a general 
fashion, one can attempt multivariate generalizations of a 
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given class of univarate kernels, often by using Bochner’s 
theorem (see Section 4.1 below). The recent work of 
(Gneiting et al., 2010; Apanasovich et al., 2012) is perhaps 
the most relevant to our model, as they show how to con¬ 
struct a family of valid kernels for multivariate Gaussian 
processes on where the marginal processes each have 
Matern kernel with different hyperparameters. 

3. Dependent Matern processes 

From a modeling perspective we would like to describe cor¬ 
relations between processes that have different (unique) hy¬ 
perparameters, whereas in most of the above models this is 
only roughly attained by taking linear combinations of pro¬ 
cesses. 

3.1. Our approach 

We will model multivariate time series X{t) = 
{xi{t),... ,Xp{t)) such that each marginal process Xi{t) is 
a stationary mean-zero Gaussian process with Matern ker¬ 
nel 

thus the processes are allowed different length scales and 
variance, while sharing a common smoothness. In what 
follows we will always assume n = | to be an integer. 

As we will explain in Section 4.3, each Xj{t) can actually 
be represented as a solution of the stochastic differential 
equation 

^ j Xj (t) = aj Wj (t) , (7) 

where w{t) is white noise and is a constant. 

3.1.1. A NEW MULTI-OUTPUT GP 

To introduce dependence among the Matern processes 
Xj(t) we correlate the input noises ajWj{t) in (7). That 
is, we let L be a p X R matrix and set 

... ,Wp{t))'^ = diag{a^^,...,ap^)LV{t), ( 8 ) 

where ) is a vector of R independent standard Brownian 
motions, which we can think of as latent noise processes. 
Note that L has absorbed the aj parameters, and (c^) = 
C = LL^ is the covariance matrix of the input noises. 

The stationary solution of these coupled SDEs results in 
multi-output GP, which we will refer to as a Dependent 
Matern process. 

In Section 5 we will show how to compute the kernel (2) 
for this new process, resulting in (for u = ^) 

Ex ^{ s ) xj { t ) oc (9) 


while for z/ = | we obtain 

2 + {t-s)(^ + ^) 

oc c.,4- ^ ■ (10) 

In both cases we are assuming s < t , and the factor 

rij = 2./IJ~/ie, + ej) ( 11 ) 

is the ratio of the geometric and arithmetic means of the 
two length-scales. 

Examining these two expressions for the kernel, one should 
note; 

1. They are not symmetric in time, as interchanging s 
and t would replace ij with £i in the exponential. That 
is, the covariance kernel respects the forward flow of 
time, which we believe to be a desired characteristic. 
Note this feature is missing from all of the models dis¬ 
cussed above. 

2. Eor £i « £j the factor is close to 1, but as £i and 
£j increasingly differ in scale goes to zero. In¬ 
tuitively this means that two processes with different 
length scales cannot move tightly together. 

3.1.2. Defining the correlation 

Even if the various length scales are quite different, the ma¬ 
trix C = (cij), which we can recover from observed data, 
can be normalized in the usual way to obtain a clear, though 
model-dependent, notion of correlation {pij ) between time 
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3.1.3. Latent force models 

Our proposed model can be thought of as a particular case 
of latent force models (Alvarez et al., 2009), although our 
motivation and approach to inference are very different. 
With a latent force model one thinks of each output time 
series as following specific physical dynamics, such as a 
damped harmonic oscillator, but that is also under the influ¬ 
ence of latent forces (modelled as GPs), which are shared 
across the outputs as we do in (8). 

With our model we are more interested in providing a in¬ 
terpretable notion of correlation between the time series, 
and do not assume knowledge of any underlying physi¬ 
cal dynamics for the outputs. We are instead interested in 
the qualitative features of the Matern class, and following, 
e.g. (Hartikainen & Sarkka, 2010; Mbalawata et al., 2013; 
Sarkka et al., 2013) we construct the SDE dynamics that 
represent such processes. 
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3.2. Computational complexity 

Gaussian processes in general suffer from the big-N prob¬ 
lem, that is, computations involving a N samples from a 
Gaussian process typically are of cubic complexity in N, 
since one usually needs to invert the N x N covariance 
matrix In the case of p processes sampled N 

times, the resulting computational cost is 0{N^p^), which 
can be already prohibitive when there are only a few hun¬ 
dred samples. 

In special cases such as equally-spaced observations there 
are faster techniques such as circulant embedding (Diet- 
rich & Newsam, 1997), and for general Gaussian processes 
there has been a recent flurry of research into sparse ap¬ 
proximations (Quinonero-Candela & Rasmussen, 2005). 

As we will see in Section 4.2.2, stochastic differential equa¬ 
tions can be transformed into state space models, which 
have the nice feature that computing the likelihood of 
N observations, or using the Kalman filter and Rauch- 
Tung-Streibel smoother for prediction, only has complex¬ 
ity 0{p^N). This allows our model to easily handle data 
containing thousands of observations. 

In order to transform our DMP model into a state space 
model we first need to set up the mathematical background 
connecting Gaussian processes and stochastic differential 
equations. However, the reader might prefer to jump to 
Section 5.1, where we show how to use the state space form 
for inference and prediction. 

4. Two approaches to univariate Gaussian 
processes 

4.1. Infinite dimensional regression 

One way of viewing a Gaussian process is as a random 
function of the form 

OO 

x{t) ='^aki>k{t), (13) 

fc =0 

where {'0fc(i)} is a collection of deterministic square inte- 
grable (L^) functions, that is, features, and we place an iid 
N{0, 1) prior on the coefficients a^. As a linear combina¬ 
tion of Gaussians is Gaussian, x{t) is clearly a GR 

To compute the kernel of (13), we take any orthonormal ba¬ 
sis {(pnit)} of L^, let g be an integrable function, and de¬ 
fine 'ipkit) to be the convolution J (j)k{u)g{t — u)du, which 
should be thought of as the Lf inner product of (j)]. and 
g{t — •). Since Ea„am = 5n,m, the kernel Ea:(s)x(f) re¬ 
duces to 

OO p 

^ = / g{s - u)g{t - u)du. (14) 


The last equality is just Parsival’s identity, relating the inner 
product of g{s — •) and g{t — •) to the dot product of their 
coefficient vectors {ipkis)} and {tpk{t)} in the orthonormal 
basis. The key point of (14) - similar to the kernel trick for 
support vector machines - is that this kernel is independent 
of the choice of orthonormal basis, and so the function g 
now defines the Gaussian process. 

By using the Fourier transform we can characterize the 
class of valid kernels. Given a stationary kernel (k(s, t) = 
k(0, t — s)), its spectral density S{^) is defined by: 

r,{0,t)= J (15) 

Noting that the right hand side of (14) describes a stationary 
kernel, we use Parsival’s identity again to see that 

J g{t-u)g{-u) = J e^*^\g(i)\^d^, (16) 

with g the Fourier transform of p. In particular, a function 
K{t) with non-negative Fourier transform (spectral density) 
is a valid kernel for a stationary GP; the precise equiva¬ 
lence, known as Bochner’s theorem (Stein, 1999), shows 
that all valid kernels arise in this fashion. 

4.2. Stochastic differential equations 

A particularly nice way to construct Gaussian processes 
on the real line is via solutions of stochastic differential 
equations (SDE’s). Although constructing Gaussian pro¬ 
cesses through SDE’s goes back to the seminal article 
of (Doob, 1944), and has been used extensively in econo¬ 
metrics (Bergstrom, 1990), it has only recently seen devel¬ 
opment in the machine learning literature (Hartikainen & 
Sarkka, 2010; Hartikainen et al., 2012; Mbalawata et al., 
2013; Sarkka et al., 2013; Reece et al., 2014). 

The archtypical SDE is the Ornstein-Uhlenbeck process 

n T 

— (t) = ax{t) (17) 

at 

where w{t) is Gaussian white noise. One can make mathe¬ 
matical sense of this via its integrated form 

x{t) — x{s) = / ax{u)du + w{t) — w{s), (18) 

J S 

with w{t) as the Weiner process (Brownian motion). Given 
an initial value a:(s), it has the solution 

a:(f) = -f / t>s. (19) 

J S 

Although it is sometimes thought that making sense of the 
integral in (19) requires the full weight of Ito calculus, for 
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deterministic (and differentiable) integrands we can use the 
integration by parts formula: /* f{u)dw{u) = w{t)f{t) — 
wis)f{s) - f* f(u)w(u)dud 


4.2.1. General case 


Higher order SDK’s of the form 


d" d^~^ 

—+ + - • • + aoa:(f) = crw(t), (20) 


such as the one defining the Matern process (7), are simi¬ 
larly interpreted. Letting F(t) be the vector of derivatives 
(x(t),x'(t),... we can rewrite (20) as 


dt 



With Q as the nxn matrix above and E = (0,..., 0, cr)^, 
the solution to (20) is completely analogous to (19): 


F{t) = -b 



e(*-^)^Edw{u). 


( 22 ) 


4.3. Connecting SDEs to GPs 

Unfortunately not all Gaussian processes on K exactly cor¬ 
respond to an SDK. The precise relationship is due to 
(Doob, 1944): A stationary Gaussian process on K can be 
represented as the stationary solution of (20) when its spec¬ 
tral density has the form 


-b 1 -b-b ai(j^) -b oop' 

(28) 

The fundamental example is the Matern class of Gaussian 
processes, which have a kernel with spectral density 

SiO = (e + p) , (29) 

where v, i, and tr are the smoothness, lengthscale, and 
variance parameters, respectively, and is a constant 
with respect to ^ and a. When ix = n + ^ we can fac¬ 
tor 5'(^) = giOdiO’ where^ 

giO = j ■ (30) 


4.2.2. Stationarity 

We now require that the eigenvalues of Q, that is, the zeros 
of its characteristic polynomial 

x”-b H--boo, (23) 

all have negative real part. In this case, taking the limit of 
(22) ast —>■ oo results in a zero mean Gaussian random vec¬ 
tor with a finite covariance matrix we denote by Eoo- If we 
then choose some initial point F{0) ^ JV{0, Sqo), the re¬ 
sulting process {E{t); f > 0) is a stationary n-dimensional 
Gaussian Markov process, with covariance kernel 

EE{s)E{tf = (24) 

The integral in (22) is also Gaussian with covariance 

(25) 

Usually we only observe the positions x{t) at a finite col¬ 
lection of times ti,..., In- Assuming corruption by obser¬ 
vation noise e^, the resulting observations of the SDK (20) 
can be written in the following state space form: 

Fitk) = + gk, (26) 

y{tk) = HE{tk) + Ck, (27) 

where {gk} are independent Gaussian with covariance 
(25), and FI = (1,0,..., 0) is the observation matrix. 

*In this interpretation only an interchange of integrals is re¬ 
quired to show that (19) solves (18). 


Hence such Matern class GP’s can be realized as solutions 
of the SDK (7) used in our multivariate GP. 

In order to put (7) into the state space form (26), we expand 
out its left hand side using the binomial theorem to obtain 
the n -b 1 X n -b 1 matrix Q in (21). With n = 0 or 1 
(corresponding to i/ = 1/2 or 3/2), we have 


Qj = (l/ij), orQj 



(31) 


Furthermore, each matrix exponential can be computed an¬ 
alytically (Jones, 1981). With n = 1, for example, we have 


— g ty/ajij 





(32) 


Although we will not make use of it here, one should note 
that by approximating g (16) with rational functions, one 
can approximately represent other Gaussian processes in 
terms of SDEs (Sarkka et al., 2013; Solin & Sarkka, 2014) 


5. Inference in the dependent Matern model 

To obtain the joint state space representation of the p 
coupled SDK’s (7), we stack the p derivative vectors 

^There are multiple choices for g{^), however, only this one 
ensures that the zeros of l/g{ix) (that is, the polynomial (23)) all 
have negative real part, and thus corresponds to a stationary SDE 
as discussed in Section 4.2.2. 
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Fi,... ,Fp together to create the length p{n + 1) vector 

F{t) = ..., ..., Xp{t),.. .,x^^\t))'^, 

(33) 

containing the p processes and their first n derivatives. 

Recalling (26) and (27), write E = 0 Ip, H = 0 Ip, 

where Ip is the p x p identity matrix and A (g) B is the 
Kronecker product of the matrices A and B. In partic¬ 
ular, HE(f) = {xi(t),... ,Xp{t)). Then with Q as the 
block diagonal matrix with blocks Qi,..., Qp as in (31), 
the equivalent state space formulation of the coupled SDE’s 
(7) can be written as 

Fitk) = + rjk (34) 

y(4) = HF(4) + el.. (35) 

Note that the matrix exponential is just a block diagonal 
matrix with blocks . The observation noise ejj is 

assumed to be iid mean-zero Gaussian with diagonal co- 
variance diag{Ti,... ,Tp). And finally the process noise 
fjk is given by (25). We omit the calculation of the needed 
stationary covariance Ego of F{t), which is a block matrix 
with the i, j-block wn + lxn + \ matrix 

if Tl — 0, (36) 

where was defined in (11), and when n = 1: 

^ (37) 

tdi / 

Finally, by substituting (36) and (37) into (24), we can ob¬ 
tain the covariances (9) and (10). 

5.1. Applying the Kalman filter and smoother 

Given a state space model 

Zk = AkZk-i + rjk, 

Vk ~ F ^k: 

and observed data yi,... ,yiy, with rjk and as indepen¬ 
dent Gaussian noise, the Kalman filter (see Murphy (2012) 
for example) recursively calculates the conditional means 
and covariances 

='E{zk\yi,...,yk-i,<d) (38) 

= E ({zk - m~){zk - TOj:)^|t/i,... ,2/fc-i, 0) . 

(39) 

We use 0 to denote the collected parameters for Ak,rik, 
and Efc. Setting Sk = HkPk + Jk, where Jk is the 


covariance matrix of the observation noise e^, the log like¬ 
lihood logP(0|t/i,..., t/Af) is, up to a constant, 

N 

logP(0) -f ^logP(yfc|2/i,---,2//c-i,0) 

k^l 

N 

= logF{e) + '^\ogJ\f{yk;Hkm^,Sk). (40) 

k=l 

For prediction we can use the Rauch-Tung-Streibel 
smoother to obtain the means and covariances, 

ruk-N =F,{zk\yi,...,yN,Q), (41) 

Pk-N = E ((zfc - mk-N){zk - rnk-N)'^\yi, ..., j/at, 0) , 

(42) 

conditional on the training data and the inferred parame¬ 
ters 0. Note that the state space framework easily handles 
the missing (test) data by modifying the observation matrix 
Hk. 

5.2. Implementation 

In the case of our state space model (34) and (35), we as¬ 
sume that the smoothness v, and the number of latent noise 
processes R is chosen ahead of time. Hence our collected 
parameters 0 are: £i,... ,£p (length-scale parameters), L 
(a p X R matrix parameterizing the covariance across the 
observed processes), and rf,... ,Tp (variances of the ob¬ 
servation noise). Our inference involves two stages: 

1. Taking the state space form (26), (27), of each uni¬ 
variate Matern process Xj (f), we estimate the individ¬ 
ual length-scales ij one-by-one by minimizing (40). 
In practice, we found that Matlab’s fminunc() works 
well. 


2. Now using the state space form (34), (35) for the 
multi-output process, we sample from the posterior 
distribution of the remaining parameters L and r^, us¬ 
ing the Metropolis-Hastings algorithm. 

6. Experiments 

In this section, we use simulated and real data to evaluate 
our method. We compare our method to some existing al¬ 
gorithms in terms of prediction accuracy. Additionally, we 
show how our method describes correlations among multi¬ 
ple time series. 


Bij — 
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Figure 2. Wave and Tide data- The values of Sotonmet tide heights between the two vertical lines are assumed to be unknown. The black 
dots represent the true values of the Sotonmet tide heights, and the black lines show the predicted mean, with ±2 standard deviations 
shaded, using three choices of smoothness parameters v in our model. The standardized mean squared errors (SMSE) are provided for 
each option. 



Figure 1. Simulated time series: xi{t) — 0.2cos(57rf) —2f+0.1e 
and X 2 (t) — t — 0.5cos(57rf) + 0.04?7 for t £ [0,1]. The ob¬ 
served samples are shown with blue and black dots, respectively. 
The black line illustrates the Kalman-filter predicted means for 
the withheld samples. 

6.1. Synthetic data 

For our first experiment, we took a random selection of 100 
times t G [0,1], and simulated two time series 

xi(t) = 0.2cos(57rf) — 2t + O.le, 

X 2 (t) = t — 0.5cos(57rf) + O.OTry, 

where e and p are iid A/^(0,1) noises. These are shown 
as the blue and black dots, respectively, in Figure 1. We 
removed the last 41 observations of the second time se¬ 
ries (black dots), illustrated by the vertical dashed line, 
and treated them as the test set. The black line shows the 
Kalman filter predicted means for the withheld data, us¬ 
ing the last sampled parameters based on our model, and 
the gray area shows the given ±2cr deviations about the 
predicted mean for both series. On a 2011 Macbook with 
a 2.3Ghz i5 processor and 8GBs of RAM it took 65 sec¬ 
onds to draw 50,000 posterior samples of the correlation 
and noise parameters, while estimating the length scales 
and predicting the missing values is near-instantaneous. 


A highly optimized Kalman filter routine might lower the 
sampling time by an order of magnitude. 

6.2. Wave and Tide data 

For our second experiment, we tested our model on wave 
and tide data from the weather stations of Cambermet, 
Chimet, and Sotonmet, all on the southern coast of the 
U.K.^. The data consists of four time series: the tide 
heights of Chimet and Sotonmet, and the wave heights of 
Cambermet and Chimet. There are 288 observations, taken 
at 5 minute intervals, from the day of January 1, 2010. Ob¬ 
servations 150 to 250 of the Sotonmet tide heights (black 
dots) were removed to make a test set. 

With this data we investigated how different choices of the 
smoothness parameter w affected performance. All simu¬ 
lations used i? = 4 independent noise sources. In figure 
2 the black dots represent the true values of the Sotonmet 
tide heights, and the black line is the predicted mean, with 
±2 standard deviations shaded. 

With 12 — 1/2 the model overestimates the correlation 
between the two tide heights (p « 0.9), resulting in an 
overconfident estimate that tracks the other tide height (red 
dots) too closely. With z/ = 5/2 we have the opposite prob¬ 
lem: despite the two tide heights staying together over the 
course of the day, there is not much correlation found be¬ 
tween their third derivatives, resulting in a very weak pre¬ 
diction. The middle case of z/ = 3/2 strikes a nice balance, 
with estimated length-scales of (102.5,75.4,24.0,41.0), 
and inferred correlation matrix 

/l.OOOO 0.6155 0.0191 0.0655\ 

( 0.6155 1.0000 0.0547 0.0984 | 

0.0191 0.0547 1.0000 0.3344 

Vo. 0655 0.0984 0.3344 l.OOOo/ 

Note the moderate correlations (« 0.6) found between the 
tide heights, and the weak correlations (< 0.1) between the 

^Data available from http://www.chimet.co.uk 
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(a) CAD 



(b) JPY 



(c) AUD 


Figure 3. The US Dollar exchange rate with respect to Canadian Dollar (CAD), Japanese Yen (JPY), and Australian Dollar (AUD). The 
black dots show the observed data. The vertical lines show the intervals where the data are assumed to be unknown (i.e., test set). The 
solid lines show the predicted means using our model. The grey areas show the corresponding 95% intervals. 


tide and wave heights. 

6.3. Financial data 

For our last example, we consider the inference of missing 
data in the multivariate financial dataset used in (Alvarez 
et al., 2010). It contains thirteen time-series for the US Dol¬ 
lar exchange rate with respect to the top 10 international 
currencies (Canadian Dollar, Euro, Japanese Yen, Great 
British Pound, Swiss Franc, Australian Dollar, Hong Kong 
Dollar, New Zealand Dollar, South Korean Won, Mexican 
Peso), and three precious metals (gold, silver, platinum), 
over all 251 working days of the 2007 calendar year. Fol¬ 
lowing (Alvarez et al., 2010) we removed the mean and 
normalized each series to have unit variance, and removed 
a test set of 251 data points, covering days 50-100, 100- 
150, and 150-200, from the Canadian Dollar, Japanese Yen, 
and Australian Dollar series, respectively. The remaining 
3051 data points were used as the training set. (There are 
already 59 missing data points from the precious metal se¬ 
ries). These three time series are shown in Figure 3, along 
with the predicted means. As before, the vertical lines show 
the intervals where the test set data was withheld. 

Because of the roughness of the paths we modeled the 13 
time series as dependent Matern(^ = processes, and re¬ 
stricted the parameter space by allowing for only H = 4 
independent noise sources. The predicted means and the 
corresponding 95% intervals are shown as solid lines and 
shaded areas respectively in Figure 3. We then compared 
with the linear model of coregionalization (FMC) where 
the kernel (4) is a combination of two Matern(:/ ”1) 
nels, and and are both of rank 2. Our model’s 
predictions had a standardized mean squared error (SMSE) 
of 0.087 (averaged across the three test outputs), while the 
FMC scored 0.49. Note that in this case our model is es¬ 
sentially the Stochastic Fatent Force model in of (Alvarez 
et al., 2009), with all four latent processes as white noise. 


Nonetheless we end up with much better predictions (for 
their best model with one smooth and three white noise 
latent processes, (Alvarez et al., 2010) quote a SMSE of 
0.2795, and 0.39 for their FMC implementation). We be¬ 
lieve this shows the power of independently modelling the 
output processes and then their correlations. 

7. Discussion 

In this paper, we have proposed a new class of stochastic 
process models for multivariate time series. Using several 
examples, we illustrated our method’s predictive power and 
interpretability. However, as discussed above, our method 
is also designed to be extendable to problems with more 
complex structures. 

One possible extension to our model would be to allow ker¬ 
nels with (quasi-)periodic behavior, leading to better infer¬ 
ence when modeling periodic phenomena such as the wave 
and tide data of Section 6.2. This is indeed possible within 
the state space approach, as exemplifed by the stochastic 
resonator model (Solin & Sarkka, 2013; 2014) and the lin¬ 
ear basis model (Reece et al., 2014). 

Referring again to the wave and tide data seen in Figure 
2, one can see that the peaks and troughs are not perfectly 
aligned, either because of a delay in one of the sensor read¬ 
ings, or physical delay due to differing sensor locations. 
It should be possible to model this within the state space 
approach, allowing for more computationally efficient and 
interpretable versions of the Gaussian process sensor net¬ 
work model presented in (Osborne et al., 2012). 
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